1. Injestion and Familiarization of the Data

This is an exploration of the tidy data set wineQualityReds.csv provided by Udacity Data Analyst Nanodegree for Project 3. This data set was chosen for the brevity of the observations in consideration for execution time of certain plots. Guiding question: “Which Variables affect Red Wine Quality?”

Initial Explorations:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Some parameter names may be a bit long to plot on ggpairs, perhaps a renaming of the parameters will be needed later. There are 1599 observations of red wine in the data-set (rather small considering other data-sets). Alcohol content of red wines range from 8.40% to 14.90% with most around 10%. The quality rating of reds in this data set are mostly between 5 and 6 with median at 6. pH of reds are stable around 3-4,

2. Free Exploration and Transformation

It seems that fixed acidity has little to do with the quality of red wines, and that there are more quality 5 to 7 wines than there qualities of other types, this is also noted in the readme wineQualityInfo.txt. Perhaps it would be better to combine the bottom two levels and the top two levels.

The histogram grid took some research to create (at first attempted a function that did not work as intended). This provided a good overview of the distribution of the different chemical attributes, residual.sugar and chlorides and perhaps sulphates could possibly use a closer look at the X-scaling for they seem to be more long-tailed. An extra note, density seems to be rather normally distributed.

From the scatter plots, sulphates, alcohol content seems to have a bit of correlation with quality where higher alcohol content seems to indicate higher quality, but the variance is still pretty high.

The boxplots seem to reveal some interesting trends where the other plots showed relatively little. A re-scaling of the Y-axis for residual.sugar, chlorides, and sulphates might reveal a better view of the boxplots.

## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and pH
## t = -37.3659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

A sanity check with fixed acidity vs pH shows that as expected there is a reasonably high correlation: -.68. Looking at the correlation coefficients, it seems that alcholol, volatile acidity, sulphates, and acid acid have the four highest correlation with quality. pH, residual sugar and free sulfur dioxide are seemingly uncorrelated with quality.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## [1] "(0.001,0.15]" "(0.15,0.3]"   "(0.3,0.5]"    "(0.5,1.2]"
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(cAcid.cut)
## t = 9.2631, df = 1465, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1862839 0.2829939
## sample estimates:
##      cor 
## 0.235221

Created the categorical variable cAcid cut from citric.acid by looking at the histogram (which is relatively evenly distributed) and then maximizing the correlation. I feel that more categorical variables could help simply and visualize how each attribute may be attributing to the quality factor. Also, correlation increases with the new categorical variable.

Free and Total Sulfur Dioxide

## [1] "(0,59]"    "(59,109]"  "(109,300]"
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(total.sulf.dioxide.cut)
## t = -8.6666, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2582798 -0.1646315
## sample estimates:
##        cor 
## -0.2119421

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :  15.0   Min.   : 5.900   Min.   :0.1900   Min.   :0.0000  
##  1st Qu.: 447.0   1st Qu.: 6.600   1st Qu.:0.3775   1st Qu.:0.0450  
##  Median :1057.5   Median : 7.350   Median :0.6000   Median :0.2000  
##  Mean   : 901.2   Mean   : 7.883   Mean   :0.5178   Mean   :0.2072  
##  3rd Qu.:1296.8   3rd Qu.: 8.900   3rd Qu.:0.6300   3rd Qu.:0.3300  
##  Max.   :1559.0   Max.   :11.800   Max.   :0.7350   Max.   :0.4900  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.700   Min.   :0.04500   Min.   :50.0       
##  1st Qu.: 2.575   1st Qu.:0.07575   1st Qu.:51.0       
##  Median : 4.300   Median :0.10550   Median :52.5       
##  Mean   : 5.950   Mean   :0.12322   Mean   :56.0       
##  3rd Qu.: 7.600   3rd Qu.:0.16950   3rd Qu.:56.5       
##  Max.   :15.400   Max.   :0.23500   Max.   :72.0       
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   : 63.0        Min.   :0.9934   Min.   :3.160   Min.   :0.4400  
##  1st Qu.: 77.5        1st Qu.:0.9957   1st Qu.:3.200   1st Qu.:0.5300  
##  Median : 96.5        Median :0.9979   Median :3.290   Median :0.7200  
##  Mean   :103.7        Mean   :0.9979   Mean   :3.312   Mean   :0.6750  
##  3rd Qu.:124.0        3rd Qu.:0.9992   3rd Qu.:3.438   3rd Qu.:0.8075  
##  Max.   :160.0        Max.   :1.0037   Max.   :3.590   Max.   :0.9300  
##     alcohol          quality             cAcid.cut total.sulf.dioxide.cut
##  Min.   : 9.000   Min.   :5.000   (0.001,0.15]:3   (0,59]   : 0          
##  1st Qu.: 9.425   1st Qu.:5.000   (0.15,0.3]  :5   (59,109] :10          
##  Median : 9.500   Median :5.000   (0.3,0.5]   :7   (109,300]: 8          
##  Mean   : 9.978   Mean   :5.556   (0.5,1.2]   :0                         
##  3rd Qu.:10.275   3rd Qu.:6.000   NA's        :3                         
##  Max.   :12.900   Max.   :7.000
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(free.cut)
## t = -1.9126, df = 1597, p-value = 0.05598
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.096599360  0.001219402
## sample estimates:
##         cor 
## -0.04780459

The readme hinted that there might be adverse tastes when free sulfur dioxide values becomes greater than 50ppm. There was an increase in correlation from the numerical version of ‘free.sulfur.dioxide’ parameter to the categorical one, perhaps it will be useful in gaining insight to the quality parameter? I also turned ‘total.sulfure.dioxide’ variable into a categorical variable (saw an increase in correlation with ‘quality’)

Chlorides and Density

## [1] "(0,0.07]"     "(0.07,0.079]" "(0.079,0.09]" "(0.09,0.7]"
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(chloride.cut)
## t = -7.0399, df = 1597, p-value = 2.846e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2206370 -0.1255389
## sample estimates:
##        cor 
## -0.1734924

Both Density and chlorides have very normal distributions, and have reasonable correlations with quality. After resizing the x-axis, it seems that chlorides is a very normally distributed variable. Turning chloride into a categorical variable increased it’s correlation with quality from -0.12 to -0.17.

## [1] "(0,0.996]"     "(0.996,0.997]" "(0.997,1.1]"
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(density.cut)
## t = -7.0216, df = 1597, p-value = 3.232e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2202076 -0.1250947
## sample estimates:
##        cor 
## -0.1730546

Both new categorical variables show an increased correlation with quality. Does this mean if a regression model is made, the categorical inputs will yield a better model?

Rabbit Hole Idea

wqr$dcut<- cut(wqr$density,
  breaks = c(0, quantile(wqr$density, 1/12),quantile(wqr$density, 2/12),
             quantile(wqr$density, 3/12),quantile(wqr$density, 4/12),
             quantile(wqr$density, 5/12),quantile(wqr$density, 6/12),
             quantile(wqr$density, 7/12),quantile(wqr$density, 8/12),
             quantile(wqr$density, 9/12),quantile(wqr$density, 10/12),
             quantile(wqr$density, 11/12),quantile(wqr$density, 12/12)))

levels(wqr$dcut)
##  [1] "(0,0.9942]"      "(0.9942,0.9951]" "(0.9951,0.9956]"
##  [4] "(0.9956,0.996]"  "(0.996,0.9964]"  "(0.9964,0.9968]"
##  [7] "(0.9968,0.9971]" "(0.9971,0.9974]" "(0.9974,0.9978]"
## [10] "(0.9978,0.9984]" "(0.9984,0.9994]" "(0.9994,1.004]"
levels(wqr$dcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")

with(wqr, cor.test(quality, as.numeric(dcut), method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(dcut)
## t = -6.8116, df = 1597, p-value = 1.363e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2152739 -0.1199932
## sample estimates:
##       cor 
## -0.168026

There was a thought that evenly distributed variables like density and pH despite the low correlation might have more to say. The idea is simple, “high density, or pH or low density or pH” will possibly produce ‘bad’ quality wines (extremes tend toward the negative outcome in other words). As wine makers try to create ‘good’ wines, the independent variables should have a tendency toward the middle of the distribution. By cutting the data into 12 equal parts and then combining the ends (high and low pairs) together, the new categorical variable might reveal better correlation with the ‘quality’ factor. A slight drop in correlation was found :(.

Something I was looking for in the plots were perhaps a higher number of low quality rankings associated category 1 of the dcut (which is the combination of the highest and lowest density after cutting the samples into 12 evenly populated segments). Plots do not reveal anything significant, although very colorful. Perhaps pH will reveal something.

wqr$pHcut<- cut(wqr$pH, 
                breaks = c(0, quantile(wqr$pH, 1/12),quantile(wqr$pH, 2/12),
                           quantile(wqr$pH, 3/12),quantile(wqr$pH, 4/12),
                           quantile(wqr$pH, 5/12),quantile(wqr$pH, 6/12),
                           quantile(wqr$pH, 7/12),quantile(wqr$pH, 8/12),
                           quantile(wqr$pH, 9/12),quantile(wqr$pH, 10/12),
                           quantile(wqr$pH, 11/12),quantile(wqr$pH, 12/12)))
levels(wqr$pHcut)
##  [1] "(0,3.1]"     "(3.1,3.16]"  "(3.16,3.21]" "(3.21,3.25]" "(3.25,3.28]"
##  [6] "(3.28,3.31]" "(3.31,3.34]" "(3.34,3.37]" "(3.37,3.4]"  "(3.4,3.45]" 
## [11] "(3.45,3.53]" "(3.53,4.01]"
levels(wqr$pHcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")

with(wqr, cor.test(quality, as.numeric(pHcut), method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  quality and as.numeric(pHcut)
## t = 0.422, df = 1597, p-value = 0.6731
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03848218  0.05954919
## sample estimates:
##        cor 
## 0.01055887
with(wqr, table(factor(quality), pHcut))
##    pHcut
##       1   2   3   4   5   6
##   3   2   3   0   3   0   2
##   4  12   6   8   7   6  14
##   5 104 133 117 107  95 125
##   6  97  98 125 102  85 131
##   7  33  24  38  35  32  37
##   8   4   5   2   5   2   0
with(subset(wqr, as.numeric(wqr$dcut)>=5), table(factor(quality), pHcut))
##    pHcut
##      1  2  3  4  5  6
##   3  1  0  0  0  0  1
##   4  1  3  3  3  2  6
##   5 30 50 60 48 37 53
##   6 23 24 37 27 35 50
##   7  6  5  9 11  5 11
##   8  0  3  0  0  1  0

Not looking good with pH, maybe a combination of pH and density will reveal an unexpected correlation. where the comination of pH and density may reveal some clustering of higher or lower quality

pH investigation dead-end, though I still haven’t given up that there might be something in the normally distributed data. Maybe Chlorides and residual sugar have some story to tell.

## [1] 1.964472
## [1] 3.202083
## [1] 1.848927
## [1] 2.078141
## [1] 1.881038
## [1] 1.677124
## [1] 0.004387833
## [1] 0.005805184
## [1] 0.002884486
## [1] 0.001565254
## [1] 0.0008676273
## [1] 0.0001363791
## [1] 0.02075111
## [1] 0.03292075
## [1] 0.02268592
## [1] 0.02371449
## [1] 0.02253024
## [1] 0.04025654
## [1] 4.007382e-06
## [1] 2.481157e-06
## [1] 2.523346e-06
## [1] 4.000036e-06
## [1] 4.733842e-06
## [1] 5.656195e-06
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Can people actually taste .1 g/liter of chloride/salt? Note: our chloride and residual sugar measurements are in g/dm^3 which is the same as g/liter. This might be a futile exercise. Looked at variances of several variables to see if there is a trend of increasing or decreasing variance. This could indicate that the tails of the distribution might contain correlation to one particular quality level once again the ‘Too much or too little yields bad quality’ theory.

wqr$chlcut<- cut(wqr$chlorides, 
  breaks = c(0, quantile(wqr$chlorides, 1/12),quantile(wqr$chlorides, 2/12),
             quantile(wqr$chlorides, 3/12),quantile(wqr$chlorides, 4/12),
             quantile(wqr$chlorides, 5/12),quantile(wqr$chlorides, 6/12),
             quantile(wqr$chlorides, 7/12),quantile(wqr$chlorides, 8/12),
             quantile(wqr$chlorides, 9/12),quantile(wqr$chlorides, 10/12),
             quantile(wqr$chlorides, 11/12),quantile(wqr$chlorides, 12/12)))
levels(wqr$chlcut)
##  [1] "(0,0.059]"     "(0.059,0.066]" "(0.066,0.07]"  "(0.07,0.074]" 
##  [5] "(0.074,0.076]" "(0.076,0.079]" "(0.079,0.082]" "(0.082,0.085]"
##  [9] "(0.085,0.09]"  "(0.09,0.097]"  "(0.097,0.114]" "(0.114,0.611]"
levels(wqr$chlcut)<-c("1","2","3","4","5","6","6","5","4","3","2","1")

summary(wqr$chlorides, 10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
ggplot(aes(x=chlorides, y=quality, color=chlcut), data=wqr) + 
  geom_point(alpha = 1/2, position="jitter", size=3)

ggplot(aes( y=quality, x=chlcut), data=wqr) + 
  geom_point(alpha = 1/2, position='jitter', size=3) +
  ggtitle("Despair Never Looked so... Colorful Why am I still doing this") +
  scale_color_brewer(type = 'seq', palette = 'Blues',
    guide = guide_legend(title = 'pH', reverse = F,
    override.aes = list(alpha = 1, size = 1)))

with(wqr, table(factor(quality), chlcut))
##    chlcut
##       1   2   3   4   5   6
##   3   4   1   1   1   1   2
##   4  12   7  12  11   4   7
##   5  98 114 101 132  93 143
##   6 112 104 102 111  84 125
##   7  45  50  34  35  21  14
##   8   3   3   3   5   2   2

The wikipedia article on “Taste” revealed that the average human detection threshold for sucrose is 10 millimoles per liter which actually translates to .01g/liter. Further searching has some studies showing that salt detection threshold is around .5 mols/liter which translates to .5 grams/liter. Most likely the difference of salts contained in these wines are not at levels that are able to be readily distinguished by humans with a range of (.012-.611 g/liter). The differences in sugar should be able to be detected however. But once again, the ‘folding’ of the distribution has not revealed any new correlations.

3. Further Exploration some Multivariate

Alcohol

Alcohol vs quality (a little backwards in the axis but it seems more natural to place quality on the x for visualization purposes). Added the median to see the the possible linear relationship with quality.

## [1] "(2,4]" "(4,5]" "(5,6]" "(6,7]" "(7,8]"

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(qual.cut) and alcohol
## t = 22.0304, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4442698 0.5195032
## sample estimates:
##       cor 
## 0.4827767

Created a new categorical variable out of ‘quality’ called ‘qual.cut’ by merging the bottom 2 categories in ‘quality’. Hope to make trends a bit clearer and provide a cleaner view of the bins.

Volatile Acidity

The negative correlation between Volatile Acidity and Citric Acid can be seen the above histogram. The higher the Volatile Acidity, the lower the Citric Acid metric. There seems to be a reasonably strong correlation shown between Total Sulfur Dioxide and quality.

Sulphates

Sulphates seem to be linearly correlated with quality as shown by the ggpairs correlation matrix this is a good variable add to a model.

Feeling somewhat unsatisfied with the current findings and the lack of elements to account for the variance in the quality ratings (although quality ratings are subjective and discrete), I started reading about how wine is rated. There are many subjective measures in rating wines, but the one that makes the most sense is Appearance (Visual), Aroma(smell), Taste, and Aftertaste (finish). Perhaps it’s best to see how good of a model we can obtain from the current inputs.

Modeling the Data

## 
## Calls:
## m1: lm(formula = I(quality) ~ alcohol, data = wqr)
## m2: lm(formula = I(quality) ~ alcohol + sulphates, data = wqr)
## m3: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = wqr)
## m4: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut, data = wqr)
## m5: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut, data = wqr)
## m6: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut, data = wqr)
## m7: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut, 
##     data = wqr)
## m8: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH, data = wqr)
## m9: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity, data = wqr)
## m10: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity + residual.sugar, data = wqr)
## m11: lm(formula = I(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity + residual.sugar + free.sulfur.dioxide, 
##     data = wqr)
## 
## =======================================================================================================================================================
##                                              m1        m2        m3        m4        m5        m6        m7        m8        m9        m10       m11   
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                                1.875***  1.375***  2.611***  2.556***  2.665***  2.697***  2.750***  4.080***  3.732***  3.646***  3.704***
##                                           (0.175)   (0.177)   (0.196)   (0.218)   (0.217)   (0.230)   (0.247)   (0.492)   (0.673)   (0.676)   (0.676)  
## alcohol                                    0.361***  0.346***  0.309***  0.317***  0.301***  0.295***  0.290***  0.304***  0.300***  0.292***  0.295***
##                                           (0.017)   (0.016)   (0.016)   (0.017)   (0.017)   (0.018)   (0.020)   (0.021)   (0.021)   (0.022)   (0.022)  
## sulphates                                            0.994***  0.679***  0.630***  0.670***  0.705***  0.712***  0.686***  0.694***  0.710***  0.693***
##                                                     (0.102)   (0.101)   (0.105)   (0.104)   (0.106)   (0.106)   (0.106)   (0.107)   (0.107)   (0.108)  
## volatile.acidity                                              -1.221*** -1.170*** -1.067*** -1.024*** -1.012*** -1.026*** -1.031*** -1.035*** -1.009***
##                                                               (0.097)   (0.123)   (0.123)   (0.125)   (0.126)   (0.126)   (0.126)   (0.126)   (0.127)  
## cAcid.cut: (0.15,0.3]/(0.001,0.15]                                      -0.062    -0.009    -0.003     0.002    -0.030    -0.033    -0.032    -0.025   
##                                                                         (0.049)   (0.049)   (0.049)   (0.050)   (0.051)   (0.051)   (0.051)   (0.051)  
## cAcid.cut: (0.3,0.5]/(0.001,0.15]                                       -0.023     0.026     0.037     0.047    -0.014    -0.026    -0.028    -0.020   
##                                                                         (0.052)   (0.052)   (0.053)   (0.056)   (0.059)   (0.061)   (0.061)   (0.061)  
## cAcid.cut: (0.5,1.2]/(0.001,0.15]                                        0.030     0.055     0.091     0.106     0.010    -0.010    -0.013     0.000   
##                                                                         (0.065)   (0.065)   (0.066)   (0.071)   (0.077)   (0.081)   (0.081)   (0.082)  
## total.sulf.dioxide.cut: (59,109]/(0,59]                                           -0.112**  -0.106*   -0.107*   -0.098*   -0.092*   -0.098*   -0.142** 
##                                                                                   (0.043)   (0.043)   (0.043)   (0.043)   (0.043)   (0.044)   (0.050)  
## total.sulf.dioxide.cut: (109,300]/(0,59]                                          -0.396*** -0.389*** -0.390*** -0.399*** -0.384*** -0.401*** -0.468***
##                                                                                   (0.073)   (0.074)   (0.074)   (0.074)   (0.076)   (0.077)   (0.085)  
## chloride.cut: (0.07,0.079]/(0,0.07]                                                          0.012     0.016     0.018     0.016     0.016     0.015   
##                                                                                             (0.049)   (0.050)   (0.050)   (0.050)   (0.050)   (0.050)  
## chloride.cut: (0.079,0.09]/(0,0.07]                                                         -0.026    -0.019    -0.019    -0.021    -0.017    -0.018   
##                                                                                             (0.051)   (0.052)   (0.052)   (0.052)   (0.052)   (0.052)  
## chloride.cut: (0.09,0.7]/(0,0.07]                                                           -0.104*   -0.097    -0.119*   -0.118*   -0.120*   -0.121*  
##                                                                                             (0.053)   (0.054)   (0.054)   (0.054)   (0.054)   (0.054)  
## density.cut: (0.996,0.997]/(0,0.996]                                                                  -0.023    -0.015    -0.027    -0.039    -0.031   
##                                                                                                       (0.049)   (0.049)   (0.051)   (0.052)   (0.052)  
## density.cut: (0.997,1.1]/(0,0.996]                                                                    -0.032    -0.026    -0.055    -0.087    -0.075   
##                                                                                                       (0.054)   (0.053)   (0.066)   (0.069)   (0.069)  
## pH                                                                                                              -0.431**  -0.345    -0.316    -0.352   
##                                                                                                                 (0.138)   (0.179)   (0.180)   (0.181)  
## fixed.acidity                                                                                                              0.015     0.019     0.017   
##                                                                                                                           (0.020)   (0.020)   (0.020)  
## residual.sugar                                                                                                                       0.019     0.016   
##                                                                                                                                     (0.013)   (0.013)  
## free.sulfur.dioxide                                                                                                                            0.004   
##                                                                                                                                               (0.002)  
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                                     0.227     0.270     0.336     0.335     0.349     0.351     0.352     0.356     0.356     0.357     0.359
## adj. R-squared                                0.226     0.269     0.335     0.332     0.345     0.347     0.346     0.350     0.350     0.350     0.351
## sigma                                         0.710     0.690     0.659     0.658     0.651     0.650     0.651     0.649     0.649     0.649     0.648
## F                                           468.267   294.988   268.912   122.348    97.582    71.693    60.624    57.329    53.529    50.371    47.674
## p                                             0.000     0.000     0.000     0.000     0.000     0.000     0.000     0.000     0.000     0.000     0.000
## Log-likelihood                            -1721.057 -1675.142 -1599.384 -1463.199 -1447.445 -1444.307 -1444.121 -1439.214 -1438.925 -1437.771 -1436.117
## Deviance                                    805.870   760.894   692.105   631.385   617.968   615.331   615.174   611.073   610.832   609.872   608.498
## AIC                                        3448.114  3358.284  3208.768  2942.398  2914.889  2914.614  2918.241  2910.428  2911.850  2911.541  2910.234
## BIC                                        3464.245  3379.793  3235.654  2984.726  2967.799  2983.397  2997.606  2995.084  3001.797  3006.779  3010.762
## N                                          1599      1599      1599      1467      1467      1467      1467      1467      1467      1467      1467    
## =======================================================================================================================================================

I decided to run a linear regression test to see how much of the variance in quality I can capture. The result is about 36% of the variance with all independent variables modeled. The same model was run with the original non-categorical variables, the resulting model was about 1-2% worse. There are several considerations:

  1. ‘quality’ is essentially a categorical variable and should not be treated as a continuous one even though it looks like one.
  2. A linear model might not be the best model, perhaps a non-linear approach could yield better results
  3. Perhaps the data does not capture enough of the variance as there are other subjective measures that wine experts take into account when rating wines such has Appearance and Aroma
## 
## Calls:
## m1: glm(formula = factor(quality) ~ alcohol, family = binomial(link = "probit"), 
##     data = wqr)
## m2: glm(formula = factor(quality) ~ alcohol + sulphates, family = binomial(link = "probit"), 
##     data = wqr)
## m3: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity, 
##     family = binomial(link = "probit"), data = wqr)
## m4: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut, family = binomial(link = "probit"), data = wqr)
## m5: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut, family = binomial(link = "probit"), 
##     data = wqr)
## m6: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut, family = binomial(link = "probit"), 
##     data = wqr)
## m7: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut, 
##     family = binomial(link = "probit"), data = wqr)
## m8: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH, family = binomial(link = "probit"), data = wqr)
## m9: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity, family = binomial(link = "probit"), data = wqr)
## m10: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity + residual.sugar, family = binomial(link = "probit"), 
##     data = wqr)
## m11: glm(formula = factor(quality) ~ alcohol + sulphates + volatile.acidity + 
##     cAcid.cut + total.sulf.dioxide.cut + chloride.cut + density.cut + 
##     pH + fixed.acidity + residual.sugar + free.sulfur.dioxide, 
##     family = binomial(link = "probit"), data = wqr)
## 
## ========================================================================================================================================================================================
##                                                m1           m2           m3           m4           m5           m6           m7           m8           m9          m10          m11     
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                                  0.597       -0.248       1.362          0.496        0.823        1.548       1.621       10.570        14.461       14.996      14.374    
##                                             (1.339)      (1.447)     (2.037)        (2.491)      (2.700)      (2.865)     (3.222)      (6.242)       (8.612)      (8.747)     (8.777)   
## alcohol                                      0.186        0.164       0.322          0.364        0.514*       0.524       0.606*       0.711*        0.779*       0.824*      0.885*   
##                                             (0.133)      (0.139)     (0.205)        (0.232)      (0.254)      (0.268)     (0.308)      (0.344)       (0.361)      (0.374)     (0.401)   
## sulphates                                                 1.757      -0.082          0.073       -0.806       -0.836      -1.478       -0.985        -0.867       -0.890      -0.469    
##                                                          (1.133)     (0.820)        (1.008)      (1.244)      (1.316)     (1.542)      (1.782)       (1.879)      (1.874)     (2.048)   
## volatile.acidity                                                     -3.057***      -2.626**     -4.380***    -4.807***   -5.618***    -5.913***     -5.869**     -5.778**    -5.839*** 
##                                                                      (0.673)        (0.809)      (1.220)      (1.350)     (1.665)      (1.780)       (1.785)      (1.764)     (1.762)   
## cAcid.cut: (0.15,0.3]/(0.001,0.15]                                                   4.066        4.725        4.913       4.992        4.561         4.656        4.593       4.761    
##                                                                                   (300.397)    (598.195)    (565.818)   (531.428)    (546.492)     (540.848)    (543.532)   (526.212)   
## cAcid.cut: (0.3,0.5]/(0.001,0.15]                                                   -0.197       -0.913       -1.050      -0.737       -1.329        -1.133       -1.158      -1.026    
##                                                                                     (0.369)      (0.520)      (0.562)     (0.591)      (0.722)       (0.773)      (0.774)     (0.777)   
## cAcid.cut: (0.5,1.2]/(0.001,0.15]                                                   -0.312       -0.758       -0.767      -0.382       -1.196        -0.862       -0.881      -0.807    
##                                                                                     (0.518)      (0.620)      (0.643)     (0.646)      (0.851)       (0.950)      (0.944)     (0.966)   
## total.sulf.dioxide.cut: (59,109]/(0,59]                                                           5.519        5.793       6.169        6.014         6.053        6.375       6.663    
##                                                                                                (635.708)    (609.846)   (554.753)    (568.242)     (555.182)    (528.901)   (522.764)   
## total.sulf.dioxide.cut: (109,300]/(0,59]                                                          6.677        7.192       7.833        7.756         7.544        7.395       7.868    
##                                                                                               (1068.682)   (1013.094)   (953.589)    (936.360)     (941.298)    (958.041)   (950.906)   
## chloride.cut: (0.07,0.079]/(0,0.07]                                                                           -0.673      -0.394       -0.317        -0.286       -0.245      -0.261    
##                                                                                                               (0.619)     (0.632)      (0.657)       (0.658)      (0.659)     (0.670)   
## chloride.cut: (0.079,0.09]/(0,0.07]                                                                           -0.524      -0.013       -0.040        -0.086       -0.059      -0.122    
##                                                                                                               (0.605)     (0.652)      (0.664)       (0.671)      (0.669)     (0.687)   
## chloride.cut: (0.09,0.7]/(0,0.07]                                                                             -0.497      -0.234       -0.238        -0.258       -0.264      -0.291    
##                                                                                                               (0.613)     (0.625)      (0.656)       (0.658)      (0.654)     (0.676)   
## density.cut: (0.996,0.997]/(0,0.996]                                                                                       0.448        0.195         0.246        0.282       0.383    
##                                                                                                                           (0.704)      (0.736)       (0.755)      (0.755)     (0.755)   
## density.cut: (0.997,1.1]/(0,0.996]                                                                                        -0.761       -0.842        -0.598       -0.459      -0.450    
##                                                                                                                           (0.637)      (0.714)       (0.819)      (0.855)     (0.859)   
## pH                                                                                                                                     -2.912        -3.920       -4.144      -4.106    
##                                                                                                                                        (1.686)       (2.303)      (2.377)     (2.454)   
## fixed.acidity                                                                                                                                        -0.178       -0.197      -0.204    
##                                                                                                                                                      (0.246)      (0.244)     (0.250)   
## residual.sugar                                                                                                                                                    -0.084      -0.095    
##                                                                                                                                                                   (0.177)     (0.178)   
## free.sulfur.dioxide                                                                                                                                                           -0.023    
##                                                                                                                                                                               (0.028)   
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Aldrich-Nelson R-sq.                           0.001        0.004        0.018        0.014        0.021        0.022        0.024        0.026        0.027        0.027        0.027  
## McFadden R-sq.                                 0.019        0.047        0.246        0.231        0.348        0.364        0.413        0.448        0.454        0.456        0.463  
## Cox-Snell R-sq.                                0.001        0.004        0.018        0.014        0.021        0.022        0.025        0.027        0.027        0.027        0.028  
## Nagelkerke R-sq.                               0.020        0.049        0.253        0.236        0.355        0.371        0.420        0.455        0.461        0.463        0.471  
## phi                                            1.000        1.000        1.000        1.000        1.000        1.000        1.000        1.000        1.000        1.000        1.000  
## Likelihood-ratio                               2.290        5.681       29.826       20.499       30.874       32.296       36.644       39.768       40.294       40.471       41.134  
## p                                              0.130        0.058        0.000        0.002        0.000        0.001        0.000        0.000        0.000        0.001        0.001  
## Log-likelihood                               -59.569      -57.873      -45.801      -34.149      -28.962      -28.251      -26.077      -24.515      -24.252      -24.163      -23.832  
## Deviance                                     119.139      115.747       91.602       68.298       57.924       56.502       52.153       49.030       48.503       48.326       47.663  
## AIC                                          123.139      121.747       99.602       82.298       75.924       80.502       80.153       79.030       80.503       82.326       83.663  
## BIC                                          133.893      137.878      121.111      119.335      123.542      143.994      154.227      158.395      165.159      172.273      178.901  
## N                                           1599         1599         1599         1467         1467         1467         1467         1467         1467         1467         1467      
## ========================================================================================================================================================================================

The generalized linear model treating the dependent variable ‘quality’ as a categorical variable yield a better model, at least in the R-squared value sense yielding a McFadden R-sq. of .481 which seems to indicate that the model is a good fit for the data (non-categorical stands at R-sq of .430). I still wonder if a non-linear approach might be able to model the data more effectively.

Sulfur Dioxide and Sulphate Exploration

##       cor 
## 0.6676665
##        cor 
## 0.05165757
##      cor 
## 0.667776
## 
## Calls:
## modFTS: lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide + sulphates + 
##     residual.sugar, data = wqr)
## 
## ===============================
## (Intercept)            4.230***
##                       (0.871)  
## total.sulfur.dioxide   0.209***
##                       (0.006)  
## sulphates              1.432   
##                       (1.148)  
## residual.sugar         0.399** 
##                       (0.141)  
## -------------------------------
## R-squared                 0.449
## adj. R-squared            0.448
## sigma                     7.771
## F                       433.388
## p                         0.000
## Log-likelihood        -5545.516
## Deviance              96325.376
## AIC                   11101.032
## BIC                   11127.918
## N                      1599    
## ===============================
##          (Intercept) total.sulfur.dioxide            sulphates 
##            4.2303610            0.2085175            1.4315258 
##       residual.sugar 
##            0.3990285
## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and sulfcomb
## t = 35.8786, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6400010 0.6943446
## sample estimates:
##       cor 
## 0.6680626

Surprisingly, linear combination between ‘sulphates’ and ‘total.sulfur.dioxide’ did not improve correlation with ‘free.sulfur.dioxide’. Another dead-end where hypothesis and testing yielded a negative result. Ready to move on at this point.

Bonus Obvervations ‘X’

A tip from a friend lead me to the following investigation. Is there any bias in the observation number and the quality of the wine? Perhaps the experts were tired and observations with larger X could indicate scoring later in the process assuming a preservation of the order.

Looking at the plot, it seems that aside from the exception of wines with quality 3, all other scoring seemed evenly distributed in terms of observations. The median tended towards the 800 mark as expected in a 1599 observation data set. One can be reasonably assured that at least there was no bias introduced through the ordering of the observations if the original order was preserved.

Principal Component Analysis

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion  0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion  0.94677 0.97810 0.99459 1.00000
## Standard deviations:
##  [1] 1.7604353 1.3877715 1.2452082 1.1014684 0.9794346 0.8121627 0.7640623
##  [8] 0.6503512 0.5870623 0.4258323 0.2440457
## 
## Rotation:
##                              PC1          PC2         PC3          PC4
## fixed.acidity         0.48931422 -0.110502738  0.12330157 -0.229617370
## volatile.acidity     -0.23858436  0.274930480  0.44996253  0.078959783
## citric.acid           0.46363166 -0.151791356 -0.23824707 -0.079418256
## residual.sugar        0.14610715  0.272080238 -0.10128338 -0.372792562
## chlorides             0.21224658  0.148051555  0.09261383  0.666194756
## free.sulfur.dioxide  -0.03615752  0.513566812 -0.42879287 -0.043537818
## total.sulfur.dioxide  0.02357485  0.569486959 -0.32241450 -0.034577115
## density               0.39535301  0.233575490  0.33887135 -0.174499758
## pH                   -0.43851962  0.006710793 -0.05769735 -0.003787746
## sulphates             0.24292133 -0.037553916 -0.27978615  0.550872362
## alcohol              -0.11323206 -0.386180959 -0.47167322 -0.122181088
##                              PC5         PC6         PC7         PC8
## fixed.acidity         0.08261366 -0.10147858  0.35022736 -0.17759545
## volatile.acidity     -0.21873452 -0.41144893  0.53373510 -0.07877531
## citric.acid           0.05857268 -0.06959338 -0.10549701 -0.37751558
## residual.sugar       -0.73214429 -0.04915555 -0.29066341  0.29984469
## chlorides            -0.24650090 -0.30433857 -0.37041337 -0.35700936
## free.sulfur.dioxide   0.15915198  0.01400021  0.11659611 -0.20478050
## total.sulfur.dioxide  0.22246456 -0.13630755  0.09366237  0.01903597
## density              -0.15707671  0.39115230  0.17048116 -0.23922267
## pH                   -0.26752977  0.52211645  0.02513762 -0.56139075
## sulphates            -0.22596222  0.38126343  0.44746911  0.37460432
## alcohol              -0.35068141 -0.36164504  0.32765090 -0.21762556
##                               PC9        PC10         PC11
## fixed.acidity         0.194020908  0.24952314  0.639691452
## volatile.acidity     -0.129110301 -0.36592473  0.002388597
## citric.acid          -0.381449669 -0.62167708 -0.070910304
## residual.sugar        0.007522949 -0.09287208  0.184029964
## chlorides             0.111338666  0.21767112  0.053065322
## free.sulfur.dioxide   0.635405218 -0.24848326 -0.051420865
## total.sulfur.dioxide -0.592115893  0.37075027  0.068701598
## density               0.020718675  0.23999012 -0.567331898
## pH                   -0.167745886  0.01096960  0.340710903
## sulphates            -0.058367062 -0.11232046  0.069555381
## alcohol               0.037603106  0.30301450 -0.314525906

Throughout this analysis, it always seemed that I might possibly be missing some hidden connections between the variables. Then I remembered that principal component analysis is one powerful way to reveal hidden structure within the data-set. It can help identify how variables work together, and maybe reduce the dimensionality of the data. Principal component 1 (PC1) contains only 28% of the total variance within the 11 possibly correlated ‘independent’ variables. And as for dimensionality reduction, PC1-PC9 contains > ~98% of the variance providing a dimensionality reduction of 2 using a cutoff of 95% variance captured.

Looking at the breakdown of first two principal components: PC1 is a principal component made up of the variables related to acidity and pH and PC2 is primarily related to sulfur dioxide. PC2 is interesting because this actually in keeping with the findings in the sulphate/sulfur dioxide exploration I performed above. Sulfur dioxide both free and total are major components of PC2 but sulphate contributes almost no variance to PC2. PC3 is the one that has some significant contribution from alcohol.

The results of the principal component analysis reveals that there’s not a super dominant component, this is also in keeping with the linear regression where it took a majority of the variables to get the best scores. I think I will use PCA much earlier in my exploratory data analysis in the future.

The graphical representation of the results of the principal component analysis can allow for a quick overview of the top contributors to individual principal components as well as the proportions of and cumulative variance. The dotted lines for the PC Weights Breakdown graph makes it easier to trace the line.

Line graphs tend to suggest a trending which is not the case with PCA weights, the bar graph is perhaps a more apt graph. I chose the line graph previously because I felt it show the information with more clarity, but perhaps displaying 12 PCs together is too cluttered and not altogether useful. A set of the most important PCs with their weight distributions are probably more useful.

4: Final Plots and Summary

After the investigation of the data set and numerous dead ends, the two variables have seem to have the strongest affect on red wine quality are Volatile Acidity and Alcohol.

Plot One

Volatile Acidity seems to be one of the primary contributors to quality, the box plot aids to reveal the linear nature of this input variable to the dependent. An added observation/visualization is that Citric Acid is negatively correlated with Volatile Acidity.

Plot Two

Principal Component Analysis

The graphical representation of the principal component analysis (PCA) helps to quickly visualize the distribution of the total variance in the data set. The bar graph showing the variance proportions for each PC and the color coordinated line graph showing the relative contributions of each variable to the respective PCs can give a quick way to find hidden relationships. The main relationships are acidity, sulfur dioxide, and alcohol corresponding to PC1, PC2 and PC3. It takes nine principal components to capture more than 95% of the total variance.

Plot Three

The final low hanging insight is that alcohol is linearly correlated with quality and should be a significant contributor if one was to build a predictive model around the given variables and red wine quality. It seems that at least for red wine, higher alcohol content means better quality score! The chart is also colored by the observation number ‘X’ to demonstrate that there is no bias in the inherent ordering of the observations (as in observations numbered 1-200 do not have a tendency to have higher quality ratings than observations 1400-1600).

5: Reflections

Exploring the Red Wine data-set has been somewhat frustrating. After coming from the lesson and the Diamonds and Facebook data-set where intuition paid off at least two times out of three, the red wine’s data refused to give up anything that was not somewhat obvious. Working with a qualitative/unbalanced sampling of the output/dependent variable was somewhat challenging as well. I keep thinking that discarding the bottom 2 and the top levels of the quality factor would vastly improve correlations at the risk of making the modeling trivial and losing all meaning to the data. The variables that affect Red Wine Quality most are ‘alcohol’, ‘volatile.acidity’ and ‘sulphates’. ‘citric.acid’ though correlated reasonably high with quality has high correlation with ‘volatile.acidity’ and loses much of its impact on the linear modeling due to the relationship. For the future, perhaps a non-linear model could be implemented and tested. Certain variables that I had hoped would show better correlation through informed factoring did not perform as expected (‘free.sulfur.dioxide’). Instead ‘total.sulfur.dioxide’ gave better correlation after factoring. The exploration of the red wine data-set has surprisingly taken more time than I expected. In the future I also think I will use Principal Component Analysis earlier in the exploration to quickly see if there are hidden or even more obvious relationships between variables. R is truly a powerful data analysis tool.